CryoMAE: Few-Shot Cryo-EM Particle Picking with Masked Autoencoders

Cryo-electron microscopy (cryo-EM) emerges as a pivotal technology for determining the architecture of cells, viruses, and protein assemblies at near-atomic resolution. Traditional particle picking, a key step in cryo-EM, struggles with manual effort and automated methods’ sensitivity to low signal-to-noise ratio (SNR) and varied particle orientations. Furthermore, existing neural network (NN)-based approaches often require extensive labeled datasets, limiting their practicality. To overcome these obstacles, we introduce cryoMAE, a novel approach based on few-shot learning that harnesses the capabilities of Masked Autoencoders (MAE) to enable efficient selection of single particles in cryo-EM images. Contrary to conventional NN-based techniques, cryoMAE requires only a minimal set of positive particle images for training yet demonstrates high performance in particle detection. Furthermore, the implementation of a self-cross similarity loss ensures distinct features for particle and background regions, thereby enhancing the discrimination capability of cryoMAE. Experiments on large-scale cryo-EM datasets show that cryoMAE outperforms existing state-of-the-art (SOTA) methods, improving 3D reconstruction resolution by up to 22.4%.


Introduction
Cryo-EM is vital for obtaining high-resolution images of biological entities, such as cells, viruses, and proteins, at cryogenic temperatures, significantly minimizing radiation damage.It has revolutionized structural biology, especially through single-particle analysis (SPA), allowing for the detailed examination of molecular structures in their nearnative state [12].The process starts with sample preparation, where specimens are vitrified in a thin ice layer to maintain their native state.Researchers then use a transmission electron microscope to gather multiple 2D projection images from different angles.Image processing includes denoising and identifying particles for 3D reconstruction.Fig. 1 presents a simplified workflow of SPA using cryo-EM [25].
Particle picking is a pivotal step in cryo-EM for isolating individual protein particles from micrographs for further analysis.The quality of particle picking significantly influences the accuracy and resolution of the reconstructed particle structure in the following steps.Challenges in particle picking include the low SNR and varied particle orientations in cryo-EM micrographs, necessitating a large sample size for accurate 3D reconstructions [1].Moreover, manual picking is inefficient, time-consuming, labor-intensive, error-prone, and introduces dataset inconsistencies [4].Mis-identifications, or false positives, further compromise reconstruction quality.These issues highlight the need for improved particle selection techniques to enhance both the efficiency of particle identification and the overall quality of cryo-EM reconstructions, emphasizing the reduction of false positives and the increase of true positives [11].
Various semi-automated and automated cryo-EM particle picking methods have been developed in response to this need.Traditional methods are categorized into template-free [13] and template-based methods [14,16,17,19].Template-free methods like the Difference of Gaussians (DoG) [21] are noise-sensitive and less effective for irregular particles.Template-based approaches struggle with particle variability and are ill-suited for novel structures, limiting their efficacy in complex cryo-EM analysis.With the advent of deep learning, NN-based particle picking methods [1,22,23,26] have been proposed, marking a significant evolution in the field.These advanced techniques leverage 0 † Chentianye Xu and Xueying Zhan contributed equally to this work.the powerful pattern recognition capabilities of deep learning models to enhance the accuracy and efficiency of particle picking.Among these methods, crYOLO [22] and Topaz [1] are notable for their widespread application.While crYOLO is recognized for its efficiency in particle detection, it occasionally misses real particles.Topaz, though capable of identifying particles with limited labeled data, is susceptible to false positives and duplicates.Despite claims of minimal data requirements, these methods still often require large-scale labeled datasets for improved performance.Moreover, they exhibit limited generalization to unseen data, restricting their applicability in diverse cryo-EM research settings.
In this study, we present cryoMAE, a cutting-edge cryo-EM particle picking approach, drawing inspiration from MAE [7].Leveraging the few-shot learning paradigm, cryoMAE is meticulously designed to first learn representative particle features from a limited set of cryo-EM particle regions efficiently, cryoMAE then detects and extracts particles from query micrographs by comparing the latent features generated for exemplars against those from regions within the query micrographs.The operation of cryoMAE unfolds in two distinct stages.Initially, it trains on a curated set of particle regions and a broader selection of unlabeled regions from a reference micrograph, utilizing a self-supervised approach.We introduce a unique self-cross similarity loss, ensuring the cryoMAE encoder generates distinct latent features for particle and non-particle areas.Subsequently, the trained encoder analyzes query micrographs, extracting and comparing latent features to exemplar features to ascertain particle locations through similarity scoring.
The performance of cryoMAE was rigorously evaluated using the CryoPPP cryo-EM particle picking dataset [4], showcasing significant enhancements in 3D particle reconstruction resolution.Particles selected by our model from this dataset exhibit up to 22.4% (average 11.1%) improvement in resolution compared to those picked using current SOTA models.Remarkably, these results were achieved using just a few labeled exemplars (e.g., 15) per protein type, highlighting cryoMAE's efficient use of limited data.
Our contributions are summarized as follows: orientations directly from the training data, making them more adaptable to different datasets without the need for specific templates.CrYOLO [22] and Topaz [1] are distinguished for their advanced particle picking capabilities in cryo-EM.CrYOLO leverages the You Only Look Once framework [15] for particle detection, and Topaz employs convolutional neural networks (CNNs) with positive-unlabeled (PU) learning.Despite their strengths, crYOLO may overlook true particles, while Topaz is prone to recognizing numerous false positives and duplicates [5].They require extensive labeled datasets, demanding significant time and resources.Our cryoMAE, utilizing few-shot learning, offers high efficiency using a minimal number of exemplars.It effectively reduces false negatives and positives, and minimizes reliance on large labeled datasets, representing a significant leap in cryo-EM particle picking technology.
MAEs were initially introduced by He et al. [7], drawing inspiration from the BERT model [3], a transformative approach in natural language processing.MAEs bring the innovative concept of masking into the realm of computer vision, a technique where random sections of an image are obscured (masked) before being processed by an encoder.Subsequently, a decoder attempts to reconstruct these masked sections.[7] demonstrated that masking a substantial portion of the image (up to 75%) compels the model to learn deeper and more comprehensive representations of the data.In our study, we harness the exceptional feature extraction capabilities of MAEs to discern unique features of particles, thereby enhancing the efficiency and accuracy of particle picking in cryo-EM.

Contrastive Learning.
Contrastive Learning has been a transformative force in unsupervised learning, concentrating on increasing the similarity between representations of positive pairs while simultaneously differentiating those of negative pairs.Pioneering this approach, the concept of contrastive loss was introduced for dimensionality reduction and embedding learning, aiming to preserve semantic similarity [6].Further advances have been made with the development of SimCLR [2], which utilizes data augmentation techniques to enhance the robustness of visual representations.Moreover, He et al. [8] introduced Momentum Contrast, a methodology for building dynamic dictionaries in contrastive learning, which refines the application of contrastive loss.This refinement ensures the consistency of the representations for negative samples across the learning process.In our research, we leverage the principles of contrastive learning to develop a unique contrastive loss mechanism called self-cross similarity loss.This innovation enables our model to effectively discriminate between regions containing particles and background regions.

Methodology
In this section, we detail cryoMAE, starting with defining the few-shot cryo-EM particle picking problem, followed by our two-stage framework.Given a reference micrograph R, containing the target particles for analysis, we first randomly select a reference micrograph R and manually label m (m is a small number, e.g.15) particle regions x l i as exemplars (X L = {x l i } m i=1 ), and randomly crop additional n regions x u j from the same cryo-EM micrograph as unlabeled regions (X U = {x u j } n j=1 ).The remaining micrographs containing the same particle are query micrograph set Q. Our goal is to leverage the limited set of exemplars X L and unlabeled regions extracted from R to detect the particles within R and Q.

Overview
As depicted in Fig. 2, our framework unfolds in two distinct stages.In stage 1, cryoMAE is trained using a mixture of labeled exemplars X L and unlabeled regions X U from R. This training process is guided by both mean squared error reconstruction loss and a novel self-cross similarity loss, which helps the model distinguish between regions with and without particles.In stage 2, trained MAE encoder scans query micrographs to identify particles, comparing latent features of regions against those of exemplars to determine similarity scores.Regions with higher similarity scores are identified as more likely to contain particles, facilitating accurate particle picking.For each protein type represented by multiple micrographs, we select a reference micrograph R with manually annotated regions X L as exemplars and crop random unlabeled regions X U from the remaining parts of R. As discussed in [1], particle regions are sparse within micrographs, making most unlabeled regions likely non-particle areas.These images are resized to 224 × 224 and further processed into 16 × 16 patches during training, which are then subjected to random masking at a rate of 75%.This process transforms exemplar and unlabeled regions into xl i for labeled exemplars and xu j for unlabeled regions, respectively.The cryoMAE encoder then generates latent features for these regions, denoted as E(x l i ) and E(x u j ) respectively.Subsequently, the MAE decoder utilizes the generated latent features to reconstruct the original input images.This reconstruction is achieved through a self-supervised process, with the original images serving as the supervisory signal.This masking encourages the model to focus on global features of cryo-EM images, enhancing understanding of particle structures and generalizing across conditions.Such a focus is crucial for overcoming the limited training data challenge in the cryo-EM field, improving the model's performance in particle detection and generalization.
Training cryoMAE incorporates both particle and unlabeled regions to bolster model robustness.Exclusive training on particle images could lead MAE to converge towards a homogeneous latent feature space for any given input, potentially escalating the false positive rate by assigning high similarity scores indiscriminately, including to background regions.By including unlabeled regions, cryoMAE learns to recognize features of non-particle spaces, avoiding overfitting to a solely particle-focused feature space.This broader training approach refines the model's ability to distinguish between particle and background regions, markedly lowering false positive rates by assigning more accurate similarity scores to non-particle areas.However, adding unlabeled regions faces some challenges: 1) the diverse background noise in cryo-EM, ranging from crystalline ice contamination and malformed particles to grayscale background regions, which demands a nuanced approach for accurate differentiation; 2) merely incorporating unlabeled data might not prompt the model to learn features unique to particles against complex backgrounds.To optimize the training efficiency of cryoMAE few-shot particle datasets and reduce overfitting risks, while also accounting for a wide range of background noise, we introduced a pre-training phase.Pre-training cryoMAE on a broader set of unlabeled regions better represents background variability.Further, introducing a self-cross similarity loss specifically addresses these noise issues, enhancing the model's ability to discern particles from backgrounds.

Self-cross similarity.
Drawing from the self-similarity concept [18], we develop a self-cross similarity loss to foster distinct latent features for particles and background within cryo-EM images, enhancing the model's ability to differentiate between these regions.This approach aims to increase the disparity in the feature space, thereby improving the precision of particle identification.As illustrated in Fig. 2, the MAE encoder's latent features are utilized not only for image reconstruction by the decoder but also are evaluated using the self-cross similarity loss, further detailed in Fig. 3 The self-similarity S sel f is calculated as the mean cosine similarity among the features of positive regions, formalized as: Similarly, the cross similarity S cross is the mean cosine similarity between features of positive and unlabeled regions: S sel f measures the similarity among latent features of exemplars, reflecting the internal consistency of particle features.This is crucial for the model to identify and enhance particle-specific patterns, facilitating better distinction from background noise.Ideally, The goal is for S sel f to increase, indicating stronger similarity within particle groups.Conversely, S cross assesses the similarity between exemplar features and those of unlabeled (negative) regions, aiming to capture the distinctiveness between particles and background.The objective is for S cross to decrease, signifying reduced feature similarity between particles and background.Self-cross similarity loss L SCS is designed to optimize these dynamics, thereby improving model's ability to differentiate between particles and backgrounds: α balances self and cross-similarity contributions, and τ sets a minimum difference threshold between them, limiting further distinction efforts beyond it.

PU learning.
Inspired by [1], we identify a limitation in our previous loss function design, which treats unlabeled data as negative.
Randomly cropped training regions may unintentionally include particles, potentially confusing the model's distinction between labeled particles and background noise.This overlap complicates training, as the model could wrongly link particle features with the background, undermining our strategy to reduce background similarity scores and challenging the model's ability to learn discriminatively.To enhance the loss formulation, we accommodate the potential inclusion of particles in unlabeled regions.Acknowledging that a certain proportion ( π) of these samples may harbor particles, we modify the representation of features for these unlabeled samples.We adjust feature representation by implementing a weighting scheme grounded in the estimated probability π that an unlabeled region harbors a particle, alongside a complementary weight 1 − π for regions likely devoid of particles.This probabilistic approach enhances the model's capacity to differentiate between particle-laden regions and pure background, optimizing the use of unlabeled data in training and improving particle identification accuracy.The presence of particles in unlabeled regions necessitates a recalibration of similarity calculations, introducing a deeper analysis of self-similarity among potential positives and their cross-similarity with potential negatives within the unlabeled data: S ll , S lu , and S uu measure the sums of cosine similarities among exemplars, between exemplars and unlabeled regions, and among unlabeled regions, respectively.In the formulas, we decide not to adjust n because we treat each latent feature adjustment as a weighted process.Under this logic, we view it as having n latent features adjusted by π and 1− π, rather than having a total of πn particle regions or (1− π)n background regions within all unlabeled regions.This enhances the clarity of our methodology and ensures its alignment with Fig. 3, thereby preserving logical coherence.The refined self-cross similarity loss, LSCS ( Ŝcross , Ŝsel f ), adeptly captures the complexity of similarity within data subsets.By refining these calculations, we refine these metrics to account for the intricate characteristics of unlabeled data, facilitating a more discerning and efficacious training regimen.The total loss of cryoMAE, taking into account the reconstruction loss: Here β adjusts the weight of the self-cross similarity loss in the overall loss function, balancing reconstruction accuracy with discriminative learning.

Stage 2: Particle Picking on Query Micrographs
In stage 2, our model undertakes particle picking by utilizing the MAE encoder to scan query micrographs and extract features from each sliding window region, as detailed in Stage 2 of Fig. 2.This stage does not employ masking for the input regions.The extracted latent features are then matched against those of exemplars through cosine similarity, assigning similarity scores to each region based on the highest similarity.Following the completion of the sliding process on a micrograph, these similarity scores are ranked.It is crucial to recognize the variability in the imaging states of different micrographs, where a single threshold does not work well.Therefore, we adopt a density-based method to determine the most suitable cutoff threshold for each micrograph automatically.This process involves calculating the average distance of each score to its k nearest neighbors, and finding the score where the rate of change in these average distances is maximized as the cutoff threshold.Coordinates of all regions with similarity scores exceeding this threshold, along with the micrograph filenames, are recorded in a .starfile.The .star format is widely used in cryo-EM to document particle coordinates, aiding in subsequent steps like 3D reconstruction using CryoSPARC.

Experiments
This section evaluates cryoMAE SOTA particle picking methods using the CryoPPP dataset, including ablation studies, sensitive analysis, and qualitative visualizations to demonstrate its effectiveness.
We evaluated cryoMAE using five distinct particle datasets from CryoPPP [4], which were obtained from the Electron Microscopy Public Image Archive (EMPIAR) database [10].EMPIAR is a publicly accessible resource that offers raw, high-resolution cryo-EM images for research and benchmarking in the field of electron microscopy.The datasets used in our experiments, identified by EMPIAR IDs 10081, 10093, 10345, 10532, and 11056, comprise 300, 300, 300, 300, and 361 micrographs, respectively, each accompanied by particle coordinate information.Each EMPIAR ID corresponds to a unique protein type, facilitating targeted analysis within our SPA framework.

Baselines.
In this study, we utilized crYOLO1 [22] and Topaz2 [1] introduced in Section 2 as our baselines.For crYOLO, we employed the general model for crYOLO pre-trained on more than 40 datasets that can select particles of previously unseen macromolecular species as claimed in [22].For Topaz, we used a pre-trained model based on ResNet [9] (16 layers, each layer has 64 units) trained on large-scale cryo-EM datasets.

Evaluation metrics.
Our evaluation metrics include precision, recall, and F1 scores.A true positive occurs when a picked particle region overlaps with a ground truth region, achieving an intersection over union (IoU) of 0.5 or higher, with each ground truth accounted for only once.False positives include picked regions that either have an IoU less than 0.5 with any ground truth region or represent multiple detections for a single ground truth.False negatives are ground truth regions that remain undetected.

Particle picking.
The cryoMAE encoder slides on and processes query images in stage 2 with a stride of 28, extracting features for each sub-region.These features are matched against exemplar features, assigning the highest similarity score to each region.Following the sliding process, scores are ordered, and a density-based approach determines the cut-off threshold by identifying a sharp change in the 5 nearest neighbor average distance list.Coordinates from regions above this threshold are pinpointed as particle locations.

3D reconstruction.
We utilized CryoSPARC [14] to conduct 3D reconstructions on particles selected by various methods and compared the resolutions of the reconstructed particles.The workflow, from particle picking to reconstructed structure, encompasses essential steps: contrast transfer function (CTF) estimation, 2D classification, 2D class selection, ab initio reconstruction, and homogeneous refinement.CTF estimation corrects for the microscope's phase contrast, crucial for high-resolution reconstructions.2D classification sorts particles into classes, removing aberrant particles to improve data quality.2D class selection further ensures only high-quality particles are used, followed by ab initio reconstruction for an initial 3D model creation without prior knowledge.

Overall Performance
The performance comparison of crYOLO, Topaz, and cryoMAE in particle picking is detailed in Tables 1 and 2, with 3D reconstruction outcomes visualized in Fig. 4.

Ablation Studies
Ablation studies validate the contributions of key cryoMAE components: self-cross similarity loss, pre-training, and exemplar similarity matching.We assessed the performance of cryoMAE across different configurations of self-cross similarity loss (with self-cross similarity loss, with unadjusted self-cross similarity loss, and with adjusted self-cross similarity loss) in Table 3, revealing optimal performance with adjusted loss.This finding highlights the crucial impact of self-cross similarity loss in enhancing feature extraction, making cryoMAE more discerning in particle selection and greatly lowering the chance of incorrect region identification.CryoMAE without self-cross similarity loss incorrectly scores many non-particle regions highly, evident from widespread white areas in Fig. 5(b)(e).In contrast, with this loss, cryoMAE specificity improves, accurately identifying particle regions, as shown in Fig. 5(c)(f), reducing false scores for background areas.Further insights are shown in Fig. 6, displaying a cosine similarity matrix for 12 regions, including 4 exemplars (1-4) and 8 unlabeled areas (5)(6)(7)(8)(9)(10)(11)(12), with region 10 being a particle region.The matrix demonstrates high similarity among particle regions and lower similarity between particle and background regions, highlighting the model's ability to group particle regions closely in the feature space and distinguish them from the background.This is key to the success of the self-cross similarity loss, enabling the model to significantly reduce similarity scores for non-target areas and concentrate high scores on central particle regions, thus reducing false positives.Conversely, models trained without this loss struggle to separate particle regions from backgrounds, leading to increased false positives.We also conduct 2D t-SNE visualizations to analyze the latent features of cryoMAE under varying conditions: trained on a dataset without unlabeled regions, trained on a dataset with unlabeled regions without the adjusted selfcross similarity loss, and trained on a dataset with unlabeled regions with the adjusted self-cross similarity loss.For each visualization, we randomly select a consistent set of 60 exemplars and 360 unlabeled regions from EMPIAR-10081 to ensure comparability across the three scenarios.The visualizations are in Fig. 7a, Fig. 7b and Fig. 7c, respectively.As demonstrated in Fig. 7a, training exclusively on particle regions leads cryoMAE to generate homogeneous latent features for any input.This approach risks elevating the false positive rate by indiscriminately assigning high similarity scores, including to background regions.Fig. 7b illustrates that incorporating unlabeled regions enables cryoMAE to discern features of non-particle regions, thus mitigating over-fitting to a particle-exclusive feature space.Consequently, the model acquires a preliminary capability to differentiate between particle and background regions, although with limited clarity (as observed in the latent feature space 2D visualization, where the blue and yellow clusters are approximately but not distinctly separated).Further advancements are evident in Fig. 7c, where the introduction of adjusted self-cross similarity loss significantly enhances the model's ability to distinguish between background regions and particles.This improvement is illustrated by the distinct separation between the two clusters in the figure, despite the presence of some yellow points within the blue cluster.These exceptions, representing particle-containing regions within unlabeled areas, are considered reasonable.

Max and mean matching strategies.
Table 5 presents a comparative study on two similarity score calculation methods for matching sliding regions against exemplar latent features: maximum vs. average cosine similarity.Table 5 reveals that maximum cosine similarity outperforms average cosine similarity, particularly in This advantage is linked to the varied orientation distributions among particle exemplars.Maximum cosine similarity effectively matches regions to their closest exemplar across different orientations, ensuring optimal scores.Conversely, average cosine similarity dilutes scores for particles with diverse orientations, as it averages across all exemplars, including those with markedly different particle orientations from the target region.This dilution lowers similarity scores for such particles, reducing their distinctiveness from the background and making accurate particle identification more challenging amidst noise.

Sensitivity Analysis
In this section, we conducted a sensitivity analysis to examine the impact of varying the number of exemplars and the sliding stride on model performance.6 shows how the model performance of cryoMAE varies with the number of exemplars used.As expected, adding more exemplars generally improves performance, owing to a more comprehensive representation of particle orientations in the similarity scoring process.This is particularly beneficial for particles with diverse orientations, as more exemplars increase the chance of capturing regions across different orientation states, improving recall.However, the performance improvement plateaus after a certain number of exemplars, with precision potentially decreasing.This is because particle orientations are limited, and once the diversity of these states is adequately covered, additional exemplars offer little benefit and may even raise false positives by increasing the likelihood of background regions being mistakenly scored highly.Thus, considering the diminishing returns beyond 15 exemplars, we identify this count as the optimal number for our few-shot learning approach.7 outlines the model performance of cryoMAE across various sliding strides, noting that decreasing stride from 56 to 14 typically boosts recall but diminishes precision.This trend can be attributed to the fact that larger strides cause a certain particle to be present in fewer windows, minimizing duplicate detections and enhancing precision.However, this can result in lower similarity scores for many particles, as they're more likely to be close to window edges, which can reduce their likelihood of being selected and decrease recall.The F1 score, a precision-recall harmony measure, tends to improve with smaller strides.Yet, reducing stride size significantly lengthens processing time per query image.Considering the trade-off between time efficiency and model accuracy, a 28-pixel stride is identified as the optimal balanced approach.

Conclusion
We introduce cryoMAE, a pioneering approach in few-shot learning tailored specifically for the cryo-EM field, significantly reducing the dependence on extensive labeled datasets for accurate particle picking.By harnessing the power of MAE and integrating a novel self-cross similarity loss, cryoMAE achieves superior performance in identifying particle-containing regions amidst the challenges posed by low SNR and diverse particle orientations.Validations on the CryoPPP dataset demonstrate cryoMAE's superiority over existing NN-based methods, marking a significant advancement in the cryo-EM analysis pipeline.This innovation not only streamlines the process of high-resolution protein structure determination but also makes it more accessible to a wider scientific audience, promising to accelerate discoveries in structural biology.

1Figure 1 :
Figure 1: In cryo-EM with SPA, electron beams capture numerous 2D images of proteins within a cryogenically preserved sample.These images are subsequently denoised and subjected to particle picking, facilitating the reconstruction of the 3D structure of the protein.

Figure 2 :
Figure 2: Overview of the two-stage cryoMAE framework: stage 1 illustrates the training phase with a mix of labeled particle and unlabeled regions, employing reconstruction loss and self-cross similarity loss.Stage 2 depicts the particle picking process, where the trained MAE encoder assesses query micrographs, leveraging latent feature comparisons to identify particle positions accurately.
trained on a dataset w/o unlabeled regions.(b) trained w/o the adjusted self-cross similarity loss.(c) trained w/ the adjusted self-cross similarity loss.

Table 2 :
Table 1 reveals the high precision of crYOLO but its tendency to 1: Performance comparison of cryoMAE, crYOLO, and Topaz on CryoPPP.Ab-initio reconstruction resolution comparison of cryoMAE, crYOLO, and Topaz across EMPIAR Datasets from CryoPPP.

Table 3 :
Comparison of cryoMAE supervised self-cross similarity loss, w/ unadjusted self-cross similarity loss L SCS , and w/ adjusted self-cross similarity loss LSCS .

Table 4 :
W/ and w/o pre-training.

Table 5 :
Max and mean matching.